Skip to content

Feature: read_parquet_mergetree #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Oct 11, 2024
Merged

Conversation

akvlad
Copy link
Collaborator

@akvlad akvlad commented Sep 30, 2024

read_parquet_mergetree

Description

The read_parquet_mergetree chsql function provides a familiar interface for ClickHouse users by emulating aspects of the MergeTree engine strategy. Its primary purpose is to efficiently merge multiple parquet files using a specified primary SORT key - without consuming excessive memory and facilitating fast range queries on the resulting file.

TLDR; A memory efficient parquet file merge/compact feature with sorting capabilities.

Syntax

COPY (SELECT * FROM read_parquet_mergetree([PARQUET_FILE_ARRAY], {PRIMARY_SORT_KEY} )) TO `{MERGED.PARQUET}`

Features

  • Merge data from multiple files, similar to how ClickHouse combines data from different parts
  • Use a specified sort key to order data, analogous to the primary key in ClickHouse MergeTree tables
  • Maintain sorted order within the merged dataset, facilitating fast range queries
  • Support glob patterns and wildcards in file array

Parameters

  • FILE_ARRAY[]: An array of file paths to merge
  • PRIMARY_SORT_KEY: Specifies the column(s) used as the primary sort key for merging and ordering data
Benchmark
COPY (SELECT * FROM read_parquet(['/folder/*.parquet']) ORDER BY some_key) TO 'sorted.parquet'
// USAGE: ~64GB RAM
COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key') TO 'sorted.parquet'
// USAGE: ~800MB RAM

@lmangani lmangani changed the title WIP Feature/parquet ordered scan WIP Feature: read_mergetree Sep 30, 2024
@lmangani
Copy link
Collaborator

lmangani commented Oct 1, 2024

Hey @carlopi any chance you or someone in the team knows how to get around the windows build error? 🙏

@carlopi
Copy link

carlopi commented Oct 1, 2024

Can you try to reduce the diff?

Or try to copy the setup of extensions like duckdb_delta.

@akvlad
Copy link
Collaborator Author

akvlad commented Oct 1, 2024

@carlopi Is it enough if I tell you that the real change is only in the file https://github.com/lmangani/duckdb-extension-clickhouse-sql/pull/13/files#diff-c5bffd6b887e2ced50224f44652dab784c9c7f7ab8c46a390410cc58490391ed ?

The other changes are just internal insignificant file moves.

Or do you need a separate PR with the function implementation?

@carlopi
Copy link

carlopi commented Oct 1, 2024

Then it's likely either a #pragma once is needed in chsql.hpp or maybe Chsql::Name & co can stay in the main header with the main extension mechanics, and the rest of the function registration should be moved to a secondary header.

@akvlad
Copy link
Collaborator Author

akvlad commented Oct 1, 2024

@carlopi Aaah . It's about the windows build problem.

From the MSVC++ linker logs I see that somehow the linker wants to link chsql_extension.obj file more than once:

chsql_extension.lib(chsql_extension.obj) : error LNK2005: "public: virtual void __cdecl duckdb::ChsqlExtension::Load(class duckdb::DuckDB &)" (?Load@ChsqlExtension@duckdb@@UEAAXAEAVDuckDB@2@@Z) already defined in chsql_extension.obj [D:\a\duckdb-extension-clickhouse-sql\duckdb-extension-clickhouse-sql\build\release\extension\chsql\chsql_loadable_extension.vcxproj]
chsql_extension.lib(chsql_extension.obj )error LNK2005:  .... already defined in chsql_extension.obj

Have no idea why it wants to link the same .obj twice. Have you encountered the similar problem anywhere?

@lmangani
Copy link
Collaborator

lmangani commented Oct 7, 2024

screenshot of @akvlad kicking the windows builder where it hurts 😄
image

@lmangani lmangani changed the title WIP Feature: read_mergetree WIP Feature: read_parquet_mergetree Oct 8, 2024
@lmangani lmangani changed the title WIP Feature: read_parquet_mergetree Feature: read_parquet_mergetree Oct 8, 2024
@lmangani
Copy link
Collaborator

amazing work @akvlad lets merge and proceed with some field testing 🎉

@lmangani lmangani merged commit 751d1a4 into main Oct 11, 2024
46 checks passed
@lmangani lmangani deleted the feature/parquet_ordered_scan branch October 11, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants